15 research outputs found

    Throughput and robustness of bioinformatics pipelines for genome-scale data analysis

    Get PDF
    The post-genomic era has been heavily influenced by the rapid development of highthroughput molecular-screening technologies, which has enabled genome-wide analysis approaches on an unprecedented scale. The constantly decreasing cost of producing experimental data resulted in a data deluge, which has led to technical challenges in distributed bioinformatics infrastructure and computational biology methods. At the same time, the advances in deep-sequencing allowed intensified interrogation of human genomes, leading to prominent discoveries linking our genetic makeup with numerous medical conditions. The fast and cost-effective sequencing technology is expected to soon become instrumental in personalized medicine. The transition of the methodology related to genome sequencing and high-throughput data analysis from the research domain to a clinical service is challenging in many aspects. One of them is providing medical personnel with accessible, robust, and accurate methods for analysis of sequencing data. The computational protocols used for analysis of the sequencing data are complex, parameterized, and in continuous development, making results of data analysis sensitive to factors such as the software used and the parameter values selected. However, the influence of parameters on results of computational pipelines has not been systematically studied. To fill this gap, we investigated the robustness of a genetic variant discovery pipeline against changes of its parameter settings. Using two sensitivity screening methods, we evaluated parameter influence on the identified genetic variants, and found that the parameters have irregular effects and are inter-dependent. Only a fraction of parameters were identified to have considerable impact on the results, suggesting that screening parameter sensitivity can lead to simpler pipeline configuration. Our results showed, that although a simple metric can be used to examine parameter influence, more informative results are obtained using a criterion related to the accuracy of pipeline results. Using the results of sensitivity screening, we have shown that the influential pipeline parameters can be adjusted to effectively increase the accuracy of variant discovery. Such information is invaluable for researchers tuning pipeline parameters, and can guide the search for optimal settings for computational pipelines in a clinical setting. Contrasting the two applied screening methods, we learned more about specific requirements of robustness analysis of computational methods, and were able to suggest a more tailored strategy for parameter screening. Our contributions demonstrate the importance and the benefits of systematic robustness analysis of bioinformatics pipelines, and indicate that more efforts are needed to advance research in this area. Web services are commonly used to provide interoperable, programmatic access to bioinformatics resources, and consequently, they are natural building blocks of bioinformatics analysis workflows. However, in the light of the data deluge, their usability for data-intensive applications has been questioned. We investigated applicability of standard Web services to high-throughput pipelines, and showed how throughput and performance of such pipelines can be improved. By developing two complementary approaches, that take advantage of established and proven optimization mechanisms, we were able to enhance Web service communication in a non-intrusive manner. The first strategy increases throughput ofWeb service interfaces by a stream-like invocation pattern. This additionally allows for data-pipelining between consecutive steps of a workflow. The second approach facilitated peer-to-peer data transfer between Web services to increase the capacity of the workflow engine. We evaluated the impact of the enhancements on genome-scale pipelines, and showed that high-throughput data analysis using standard Web service pipelines is possible, when the technology is used sensibly. However, considering the contemporary data volumes and their expected growth, methods capable of handling even larger data should be sought. Systematic analysis of pipeline robustness requires intensive computations, which are particularly demanding for high-throughput pipelines. Providing more efficient methods of pipeline execution is fundamental for enabling such examinations on a largescale. Furthermore, the standardized interfaces of Web services facilitate automated executions, and are perfectly suited for coordinating large computational experiments. I speculate that, provided wide adoption of Web service technology in bioinformatics pipelines, large-scale quality control studies, such as robustness analysis, could be automated and performed routinely on newly published computational methods. This work contributes to realizing such a conception, providing technical basis for building the necessary infrastructure and suggesting methodology for robustness analysis

    Data partitioning enables the use of standard SOAP Web Services in genome-scale workflows

    Get PDF
    Biological databases and computational biology tools are provided by research groups around the world, and made accessible on the Web. Combining these resources is a com- mon practice in bioinformatics, but integration of heterogeneous and often distributed tools and datasets can be challenging. To date, this challenge has been commonly addressed in a pragmatic way, by tedious and error-prone scripting. Recently however a more reliable technique has been identified and proposed as the platform that would tie together bioinfor- matics resources, namely Web Services. In the last decade the Web Services have spread wide in bioinformatics, and earned the title of recommended technology. However, in the era of high-throughput experimentation, a major concern regarding Web Services is their ability to handle large-scale data traffic. We propose a stream-like communication pattern for standard SOAP Web Services, that enables efficient flow of large data traffic between a workflow orchestrator and Web Services. We evaluated the data-partitioning strategy by comparing it with typical communication patterns on an example pipeline for genomic sequence annotation. The results show that data-partitioning lowers resource demands of services and increases their throughput, which in consequence allows to execute in-silico experiments on genome-scale, using standard SOAP Web Services and workflows. As a proof-of-principle we annotated an RNA-seq dataset using a plain BPEL workflow engine

    Direct data transfer between SOAP web services in orchestration

    No full text
    In scientific data analysis, workflows are used to integrate and coordinate resources such as databases and tools. Workflows are normally executed by an orchestrator that invokes component services and mediates data transport between them. Scientific data are frequently large, and brokering large data increases the load on the orchestrator and reduces workflow performance. To remedy this problem, we demonstrate how plain SOAP web services can be tailored to support direct service-to-service data transport, thus allowing the orchestrator to delegate the data-flow. We formally define a data-flow delegation message, develop an XML schema for it, and analyze performance improvement of data-flow delegation empirically in comparison with the regular orchestration using an example bioinformatics workflow

    Novel SLC19A3 Promoter Deletion and Allelic Silencing in Biotin-Thiamine-Responsive Basal Ganglia Encephalopathy.

    No full text
    BACKGROUND:Biotin-thiamine responsive basal ganglia disease is a severe, but potentially treatable disorder caused by mutations in the SLC19A3 gene. Although the disease is inherited in an autosomal recessive manner, patients with typical phenotypes carrying single heterozygous mutations have been reported. This makes the diagnosis uncertain and may delay treatment. METHODS AND RESULTS:In two siblings with early-onset encephalopathy dystonia and epilepsy, whole-exome sequencing revealed a novel single heterozygous SLC19A3 mutation (c.337T>C). Although Sanger-sequencing and copy-number analysis revealed no other aberrations, RNA-sequencing in brain tissue suggested the second allele was silenced. Whole-genome sequencing resolved the genetic defect by revealing a novel 45,049 bp deletion in the 5'-UTR region of the gene abolishing the promoter. High dose thiamine and biotin therapy was started in the surviving sibling who remains stable. In another patient two novel compound heterozygous SLC19A3 mutations were found. He improved substantially on thiamine and biotin therapy. CONCLUSIONS:We show that large genomic deletions occur in the regulatory region of SLC19A3 and should be considered in genetic testing. Moreover, our study highlights the power of whole-genome sequencing as a diagnostic tool for rare genetic disorders across a wide spectrum of mutations including non-coding large genomic rearrangements

    Better safe than sorry-Whole-genome sequencing indicates that missense variants are significant in susceptibility to COVID-19.

    No full text
    Undoubtedly, genetic factors play an important role in susceptibility and resistance to COVID-19. In this study, we conducted the GWAS analysis. Out of 15,489,173 SNPs, we identified 18,191 significant SNPs for severe and 11,799 SNPs for resistant phenotype, showing that a great number of loci were significant in different COVID-19 representations. The majority of variants were synonymous (60.56% for severe, 58.46% for resistant phenotype) or located in introns (55.77% for severe, 59.83% for resistant phenotype). We identified the most significant SNPs for a severe outcome (in AJAP1 intron) and for COVID resistance (in FIG4 intron). We found no missense variants with a potential causal function on resistance to COVID-19; however, two missense variants were determined as significant a severe phenotype (in PM20D1 and LRP4 exons). None of the aforementioned SNPs and missense variants found in this study have been previously associated with COVID-19

    Beyond GWAS—Could Genetic Differentiation within the Allograft Rejection Pathway Shape Natural Immunity to COVID-19?

    No full text
    COVID-19 infections pose a serious global health concern so it is crucial to identify the biomarkers for the susceptibility to and resistance against this disease that could help in a rapid risk assessment and reliable decisions being made on patients’ treatment and their potential hospitalisation. Several studies investigated the factors associated with severe COVID-19 outcomes that can be either environmental, population based, or genetic. It was demonstrated that the genetics of the host plays an important role in the various immune responses and, therefore, there are different clinical presentations of COVID-19 infection. In this study, we aimed to use variant descriptive statistics from GWAS (Genome-Wide Association Study) and variant genomic annotations to identify metabolic pathways that are associated with a severe COVID-19 infection as well as pathways related to resistance to COVID-19. For this purpose, we applied a custom-designed mixed linear model implemented into custom-written software. Our analysis of more than 12.5 million SNPs did not indicate any pathway that was significant for a severe COVID-19 infection. However, the Allograft rejection pathway (hsa05330) was significant (p = 0.01087) for resistance to the infection. The majority of the 27 SNP marking genes constituting the Allograft rejection pathway were located on chromosome 6 (19 SNPs) and the remainder were mapped to chromosomes 2, 3, 10, 12, 20, and X. This pathway comprises several immune system components crucial for the self versus non-self recognition, but also the components of antiviral immunity. Our study demonstrated that not only single variants are important for resistance to COVID-19, but also the cumulative impact of several SNPs within the same pathway matters

    The Thousand Polish Genomes—A Database of Polish Variant Allele Frequencies

    No full text
    Although Slavic populations account for over 4.5% of world inhabitants, no centralised, open-source reference database of genetic variation of any Slavic population exists to date. Such data are crucial for clinical genetics, biomedical research, as well as archeological and historical studies. The Polish population, which is homogenous and sedentary in its nature but influenced by many migrations of the past, is unique and could serve as a genetic reference for the Slavic nations. In this study, we analysed whole genomes of 1222 Poles to identify and genotype a wide spectrum of genomic variation, such as small and structural variants, runs of homozygosity, mitochondrial haplogroups, and de novo variants. Common variant analyses showed that the Polish cohort is highly homogenous and shares ancestry with other European populations. In rare variant analyses, we identified 32 autosomal-recessive genes with significantly different frequencies of pathogenic alleles in the Polish population as compared to the non-Finish Europeans, including C2, TGM5, NUP93, C19orf12, and PROP1. The allele frequencies for small and structural variants, calculated for 1076 unrelated individuals, are released publicly as The Thousand Polish Genomes database, and will contribute to the worldwide genomic resources available to researchers and clinicians

    Fig 4 -

    No full text
    Genomic regions that correspond to the location of two missense SNPs with a potential causal function on a severe outcome of COVID–19, with green dots representing the missense SNPs in severe (A) and resistant (B) phenotypes.</p
    corecore